Generalized Additive Models in Fraud Detection

Data Science Capstone Project

Author

Grace Allen, Kesi Allen, Sonya Melton, Pingping Zhou

Published

November 21, 2025

Slides

Literature Review

Introduction

Generalized Additive Models (GAMs) have emerged as a powerful extension of traditional regression methods, offering a balance between predictive flexibility and interpretability. Originally introduced by Hastie & Tibshirani (1986) and Hastie & Tibshirani (1990), GAMs build on the framework of Generalized Linear Models (GLMs) by replacing the strictly linear predictor with a sum of smooth, data-driven functions. This structure allows models to capture complex nonlinear relationships while preserving interpretability, making them especially valuable in fields where transparency is critical, including finance, healthcare, auditing, and cybersecurity. Their ability to represent nonlinear effects in a way that stakeholders and regulators can directly review has positioned GAMs as an important tool in modern statistical and machine learning applications.

The foundations of GAMs are grounded in penalized likelihood estimation and iteratively reweighted least squares (HalDa, 2012), while modern implementations such as the mgcv package in R (Wood, 2017, 2025) have greatly improved their efficiency, scalability, and robustness. Penalization techniques introduced by Wood (2017) allow smoothness control, prevent overfitting, and address issues such as concurvity, making GAMs well-suited for noisy or high-dimensional datasets. These developments have made GAMs increasingly practical for real-world applications. Transparency also remains central: as Zlaoui (2018) illustrates, GAMs provide interpretable risk curves that visualize how each feature influences an outcome, offering critical insight in high-stakes environments.

Applications of GAMs across different fields underscore their versatility. In ecology, they have been used to map species distributions and detect environmental thresholds (Detmer, 2025; Guisan et al., 2002). In biostatistics, they have informed studies of health outcomes such as alcohol use (White et al., 2020). In finance and auditing, GAMs have uncovered irregular revenue patterns and detected fraudulent Medicare billing, with results that auditors and regulators could interpret directly (Brossart et al., 2015; Miller, 2025). Even in challenging contexts where noisy or uneven data reduce precision, studies have shown that recall and interpretability remain strong advantages of the approach (Detmer, 2025; Guisan et al., 2002; Tragouda et al., 2024).

Building on these foundations, researchers have proposed several extensions and innovations. Functional and Dynamic GAMs account for functional predictors and temporal dependencies, enhancing model flexibility for forecasting and time-series applications (DGAM, 2021; FGAM, 2015). Neural-inspired variants such as Neural Additive Models (Agarwal et al., 2021) and GAMformer (GAMformer, 2023) integrate deep learning techniques, improving computational efficiency and extending the ability of GAMs to model complex nonlinear data. Bayesian approaches provide clearer ways to quantify uncertainty and guide variable selection (Miller, 2025). Other tools such as Gam.hp (2020) strengthen transparency by quantifying predictor contributions. Furthermore, Microsoft’s Explainable Boosting Machine explored by Lou et al. (2012) adapts the GAM framework to include limited interactions, improving predictive performance while retaining interpretability.

Research also highlights the role of GAMs within broader fraud detection strategies. In financial contexts, Tragouda et al. (2024) applied GAMs to bank cheque fraud, demonstrating high recall (77.8%) even when data imbalance reduced precision. Brossart et al. (2015) used GAMs to identify fraudulent Medicare billing, showing that interpretability helped build auditor trust despite challenges with adapting to emerging patterns. Miller (2025) combined GAMs with ensemble models such as random forests to detect irregular revenue in financial statements, producing visualizations auditors could use directly. Beyond GAMs, graph-based frameworks have emerged as complementary approaches. For example, Chang et al. (2022) introduced Graph Neural Additive Networks (GNANs), extending GAMs to graph-structured data such as transaction networks and achieving 84.5% ROC-AUC in detecting suspicious users. Zhang et al. (2025) demonstrated that GAMs could model sequential features in telecom fraud detection but were often outperformed by graph neural networks (GNNs) when modeling complex relational data.

In parallel, other interpretable machine learning techniques continue to shape the fraud detection landscape. Hanagandi et al. (2023) applied regularized generalized linear models, including Ridge, Lasso, and ElasticNet, to highly imbalanced credit card fraud datasets, achieving strong performance (up to 98.2% accuracy with Ridge regression) and showing that careful preprocessing is essential for real-time fraud detection. Generative approaches also contribute: Zhu et al. (2023) demonstrated how Generative Adversarial Networks (GANs) can generate synthetic transaction data to improve robustness against class imbalance. Collectively, these innovations expand the interpretability-performance frontier and highlight how transparent modeling frameworks, including GAMs and their extensions, remain central to modern fraud analytics.

The primary objectives of this analysis are to leverage the fraud detection transactions dataset to build and evaluate effective fraud detection models using Generalized Additive Models (GAMs). Specifically, the goals are:

  • Develop Robust Models: Construct models that accurately distinguish between fraudulent and legitimate transactions using GAMs.

  • Identify Key Features: Pinpoint significant variables that contribute to fraud risk, improving interpretability and providing actionable insights for financial institutions.

  • Provide Practical Insights: Generate findings that enhance anomaly detection, risk management, and financial security strategies, while addressing challenges such as noise and class imbalance.

In this study, we apply GAM methodology using RStudio and the mgcv package to the Fraud Detection Transactions Dataset from Kaggle (Ashar, 2024). This synthetic yet realistic dataset provides an opportunity to test GAMs in a controlled but meaningful context. The aim is to use generalized additive models to identify which variables can be used to predict fraudulent activity.

Methods

Generalized Additive Models (GAMs) extend traditional regression by allowing flexible, nonlinear relationships between predictors and the response variable. In the context of fraud detection, GAMs model the probability that a transaction is fraudulent as a smooth and interpretable function of key predictors such as transaction amount, account activity, and time of day. Continuous variables are represented with spline-based smooth functions to capture nonlinear patterns, while categorical variables are incorporated as factors. The model is fitted using the mgcv package in R, which applies penalized regression splines and generalized cross-validation (GCV) to optimize smoothness and prevent overfitting (Wood, 2017). After fitting, the smooth terms illustrate how each variable influences fraud likelihood, enabling visual interpretation of complex effects. Model performance is then evaluated using metrics such as AUC, accuracy, and recall, and the trained model is applied to the test dataset to identify fraudulent transactions.

The overall modeling process is summarized in the flow chart below, which outlines the key steps from data preparation through model evaluation and interpretation.

%%{init: {'theme': 'base', 'themeVariables': { 
  'background': '#FAFAF5',
  'primaryColor': '#4682B4',
  'secondaryColor': '#1E3A8A',
  'lineColor': '#1E3A8A',
  'nodeBorder': '#1E3A8A',
  'primaryTextColor': '#FFFFFF',
  'textColor': '#191970',
  'fontSize': '12px',
  'width': '100%'
}}}%%

flowchart TB
A["Data Preparation<br/>- Clean data<br/>- Encode categorical variables"] --> B["Exploratory Data Analysis<br/>- Check distributions<br/>- Identify predictors"]
B --> C["Split Data<br/>- Train/Test sets<br/>- Stratify by fraud outcome"]
C --> D["Specify GAM Model<br/>- Select predictors<br/>- Define smooth terms<br/>- Family = binomial"]
D --> E["Fit Model<br/>mgcv::gam(...)"]
E --> F["Evaluate Model<br/>- ROC/AUC<br/>- Confusion Matrix"]
F --> G["Interpret Results<br/>- Plot smooth effects"]
G --> H["Predict New Data<br/>- Apply model to test or new cases"]

style H fill:#FF4C4C,stroke:#8B0000,color:#FFFFFF

Equation

Formally, a GAM can be expressed as:

\[ g(\mu) = \alpha + s_1(X_1) + s_2(X_2) + \dots + s_p(X_p) \]

where \(g(\mu)\) is the link function (e.g., logit for binary outcomes or identity for continuous outcomes), \(\alpha\) is the intercept, and \(s_j(X_j)\) are smooth functions of the predictor variables \(X_j\) (HalDa, 2012). This structure allows each predictor to contribute a smoothed effect to the model, capturing complex patterns in the data without obscuring the individual influence of each variable. By balancing flexibility and clarity, GAMs offer a practical alternative to fully nonparametric methods, which can become computationally intensive and difficult to interpret (Hastie & Tibshirani, 1990). The additive smooth functions \(s_j(X_j)\) are at the heart of GAMs, enabling the model to uncover nonlinear patterns while maintaining interpretability for each predictor (HalDa, 2012).

Assumptions

The model assumes a link function that relates the predictors to the response in an approximately linear manner. For fraud detection, this typically involves using a logit link to model the probability that a transaction is fraudulent. The effects of the predictors are additive, meaning each variable contributes independently, and the overall prediction is the sum of these individual effects. Observations are assumed to be independent, so one transaction does not influence another, and each case is treated separately (Wood, 2017). The model also assumes that relationships change smoothly; as a predictor changes, its effect on fraud risk evolves gradually rather than abruptly. The response variable is assumed to follow a known distribution, which in this project is binomial since the outcome is either fraud or non-fraud. Smoothness settings and penalty values are chosen to allow the model to capture real patterns without overfitting the data. Finally, predictors are assumed not to be highly correlated with one another, which ensures that the model can estimate each variable’s effect clearly (Hastie & Tibshirani, 1986).

Justification of Model Choice

Generalized Additive Models (GAMs) are used for this fraud detection project because they balance predictive flexibility, interpretability, and statistical rigor. Fraud detection involves nonlinear relationships, rare events, and high stakes for transparency. GAMs are well-suited to address these challenges while providing interpretable insights.

Alignment with Fraud Data Characteristics

Fraudulent transaction data often exhibits nonlinear effects among predictors such as transaction amount, account balance, risk score, and temporal activity. For example, fraud probability may increase sharply beyond certain transaction thresholds or vary depending on account age or device type. Linear models cannot capture these patterns without arbitrary transformations, whereas GAMs use smooth spline functions to model each predictor’s contribution automatically. Class imbalance, where fraud cases are rare, is handled via penalized splines, preventing overfitting and enabling robust generalization (Wood, 2017).

Interpretability and Transparency

GAMs produce visualizations showing how each feature affects fraud probability, allowing stakeholders to see whether increasing a risk score or transaction amount raises or lowers fraud likelihood. This transparency is critical for financial compliance under frameworks like GDPR and AI transparency mandates. GAMs provide strong predictive performance while generating insights that can be clearly communicated to non-technical audiences.

Theoretical and Practical Advantages

GAMs extend Generalized Linear Models (GLMs) by relaxing strict linearity assumptions and incorporating additive smooth functions. This flexibility allows the model to adapt to complex real-world data without losing interpretability (Wood, 2025). Modern implementations in R, such as mgcv, offer automatic smoothness estimation and robust fitting through Generalized Cross-Validation (Wood, 2025), making GAMs scalable to large datasets (Ashar, 2024).

Relevance to Fraud Analytics

Smooth risk curves produced by GAMs highlight critical inflection points where fraud risk increases, enabling investigators to understand and communicate predictions. Studies demonstrate GAMs’ effectiveness in fraud detection and financial auditing, showing high recall rates and enhanced confidence in identifying irregular records (Brossart et al., 2015; Tragouda et al., 2024). While more complex methods like deep learning may achieve higher raw accuracy, GAMs maintain interpretability and can serve as benchmark or surrogate models for black-box algorithms, supporting strong operational performance in general. GAMs provide a modeling framework that balances accuracy, interpretability, and accountability, making them well-suited for effective and trustworthy fraud detection.

Analysis and Results

Data Exploration and Visualization

Data set Description

The Fraud Detection Transactions Dataset (Ashar, 2024) is a meticulously crafted, synthetic dataset that replicates real-world financial transaction patterns, making it a robust resource for building and testing fraud detection models. Hosted on Kaggle, it is tailored for binary classification tasks, with transactions labeled as fraudulent (1) or non-fraudulent (0), and is designed to simulate the complexity of financial systems while ensuring ethical data usage by avoiding real user information. The dataset’s realistic design captures nuanced fraud patterns, such as clustered fraudulent transactions, subtle anomalies, or irregular user behaviors, providing a challenging yet representative environment for machine learning applications in anomaly detection, risk assessment, and fraud prevention.

The dataset’s synthetic nature replicates realistic fraud patterns, including clustered fraudulent transactions, subtle anomalies, and irregular user behaviors, while avoiding privacy concerns. Although the exact number of records is unspecified, the data set is designed to be sufficiently large and diverse, with a mix of typical transactions and rare fraudulent events to address class imbalance — a common challenge in fraud detection. Potential data quality issues, such as noisy data, missing values, or outliers, reflect real-world complexities and require preprocessing steps like data cleaning, categorical encoding, or normalization. These challenges necessitate robust modeling techniques to handle noise and ensure accurate predictions.

Key Characteristics

The dataset simulates real-world financial transaction patterns, capturing diverse user behaviors and transaction characteristics while ensuring ethical data usage through its synthetic design. It is tailored for binary classification tasks, with transactions labeled as fraudulent (1) or non-fraudulent (0), and includes 50,000 rows of data with 21 features categorized as follows:

  • Size and Scope: Contains thousands of individual transactions, each labeled as either fraudulent (1) or non-fraudulent (0).

  • Features (21 total):

    • Numerical variables: transaction amounts, risk scores, balances, and other continuous measures.

    • Categorical variables: transaction types (e.g., payment, transfer, withdrawal), device types, and merchant categories.

    • Temporal variables: transaction time, day, and sequencing patterns that capture behavioral dynamics.

  • Label Distribution: Fraudulent transactions represent a small percentage of the data, reflecting the real-world class imbalance in fraud detection problems.

  • Realism: Although synthetic, the dataset mirrors real-world fraud scenarios by including behavioral signals, unusual spending patterns, and high-risk profiles.

Flexibility: Supports various modeling approaches, from interpretable methods (e.g., GAMs, logistic regression) to high-performance ensemble models (e.g., XGBoost).

Visualizations

Tables 1, 2, and 3 display the counts for our categorical variables. While the dataset is synthetic and the categories are relatively evenly distributed, generalized additive models (GAMs) remain an appropriate analytical approach. GAMs provide the flexibility to model complex, nonlinear relationships between predictors and outcomes, accommodating both categorical and continuous variables. The even distribution of categories in the synthetic data does not compromise the validity of GAMs; it primarily affects the interpretability of specific category effects rather than the model’s overall applicability. Therefore, GAMs can still yield meaningful insights into the underlying patterns and relationships within this dataset.

Code
# Load libraries
library(tidyverse)
library(janitor)
library(gt)
library(scales)

# === Load dataset ===
data_path <- "synthetic_fraud_dataset.csv"
df <- readr::read_csv(data_path, show_col_types = FALSE) |>
  clean_names()

# === Create count tables ===
tbl_type <- df |>
  count(transaction_type, name = "Count") |>
  arrange(desc(Count)) |>
  rename(Type = transaction_type)

tbl_device <- df |>
  count(device_type, name = "Count") |>
  arrange(desc(Count)) |>
  rename(Device = device_type)

tbl_merchant <- df |>
  count(merchant_category, name = "Count") |>
  arrange(desc(Count)) |>
  rename(Merchant_Category = merchant_category)

# === Blue Theme for gt Tables ===
style_blue_gt <- function(.data, title_text) {
  .data |>
    gt() |>
    tab_header(title = md(title_text)) |>
    fmt_number(columns = "Count", decimals = 0, sep_mark = ",") |>
    tab_options(
      table.font.names = "Arial",
      table.font.size  = 14,
      data_row.padding = px(6),
      heading.align    = "left",
      table.border.top.color    = "darkblue",
      table.border.top.width    = px(3),
      table.border.bottom.color = "darkblue",
      table.border.bottom.width = px(3)
    ) |>
    tab_style(
      style = list(cell_fill(color = "darkblue"),
                   cell_text(color = "white", weight = "bold")),
      locations = cells_title(groups = "title")
    ) |>
    tab_style(
      style = list(cell_fill(color = "steelblue"),
                   cell_text(color = "white", weight = "bold")),
      locations = cells_column_labels(everything())
    ) |>
    opt_row_striping() |>
    cols_align("right", columns = "Count")
}

# === Render all three blue tables ===
style_blue_gt(tbl_type, "Table 1 – Transaction Types and Counts")
Table 1 – Transaction Types and Counts
Type Count
POS 12,549
Online 12,546
ATM Withdrawal 12,453
Bank Transfer 12,452
Code
style_blue_gt(tbl_device, "Table 2 – Device Types and Counts")
Table 2 – Device Types and Counts
Device Count
Tablet 16,779
Mobile 16,640
Laptop 16,581
Code
style_blue_gt(tbl_merchant, "Table 3 – Merchant Categories and Counts")
Table 3 – Merchant Categories and Counts
Merchant_Category Count
Clothing 10,033
Groceries 10,019
Travel 10,015
Restaurants 9,976
Electronics 9,957

Distribution of Variables

Figure 1. includes histograms that provide a visual overview of key variables used in fraud detection modeling, revealing important patterns that inform feature engineering and model selection. Most features—such as Account_Balance, Transaction_Distance, Risk_Score, and Card_Age—exhibit uniform distributions, indicating evenly spread values that are well-suited for capturing nonlinear effects in models like GAM. In contrast, Transaction_Amount shows a strong right-skewed distribution, with most transactions involving small amounts and a few high-value outliers concentrated in the far right tail. This skewed pattern suggests that fraudulent behavior may cluster around extreme transaction amounts, making this feature particularly predictive. To address the skewness and better model the curved fraud risk pattern across transaction sizes, a log-transformation or nonlinear modeling approach like GAM can help stabilize variance and enhance detection accuracy.

Figure 1.

Code
library(tidyverse)
library(lubridate)
library(patchwork)  # for arranging multiple ggplots

# Load dataset
fraud_data <- read.csv("synthetic_fraud_dataset.csv")

# Convert Timestamp to date and calculate Issuance_Year if needed
fraud_data <- fraud_data %>%
  mutate(
    Timestamp = ymd_hms(Timestamp, quiet = TRUE),  # adjust format if needed
    Transaction_Year = year(Timestamp),
    Issuance_Year = Transaction_Year - Card_Age
  ) %>%
  filter(!is.na(Card_Age))  # remove rows with NA in Card_Age

# Variables to plot (move Transaction_Amount to last)
numeric_vars <- c("Account_Balance", "Transaction_Distance", "Risk_Score", "Card_Age", "Transaction_Amount")

# Create a list to store plots
plot_list <- list()

# Generate plots and store in the list
for (var in numeric_vars) {
  p <- ggplot(fraud_data, aes_string(x = var)) +
    geom_histogram(fill = "steelblue", color = "white", bins = 30) +
    labs(title = paste("Distribution of", var),
         x = var,
         y = "Count") +
    theme_light()
  
  plot_list[[var]] <- p
}

# Arrange plots in a grid: 2 plots per row
(plot_list[[1]] | plot_list[[2]]) /
(plot_list[[3]] | plot_list[[4]]) /
plot_list[[5]]  # Transaction_Amount appears last

Figure 2 provides insight to the card age variable. Card age tends to show a left-skewed distribution — many cards are relatively new, with fewer older cards. Older cards (e.g., issued in 2015–2017) may be more vulnerable if security features are outdated.Newer cards (e.g., 2023–2024) might show different usage patterns — possibly more digital or mobile transactions.Peaks in certain years could reflect onboarding campaigns or fraud targeting specific cohorts.This suggests that fraud risk may vary by card maturity: new cards could face higher risk due to unfamiliar usage patterns. GAM’s smooth terms can model such non-monotonic age–fraud relationships.

Code
library(tidyverse)
library(lubridate)
# Load libraries
library(ggplot2)
library(dplyr)
library(tidyr)    # For pivot_longer
library(gridExtra) # For arranging plots
#install.packages("moments") 
library(moments)   # For skewness and kurtosis
# Load dataset
fraud_data <- read.csv("synthetic_fraud_dataset.csv")

# Convert Timestamp to date, calculate Transaction Year and Issuance Year, exclude NAs
fraud_data <- fraud_data %>%
  mutate(
    Timestamp = ymd_hms(Timestamp),               # adjust if format differs
    Transaction_Year = year(Timestamp),
    Issuance_Year = Transaction_Year - Card_Age
  ) %>%
  filter(!is.na(Issuance_Year), !is.na(Card_Age))  # remove rows with NA

# Bin Issuance Year into 5-year ranges and drop unused NA factor levels
fraud_data <- fraud_data %>%
  mutate(
    Issuance_Year_Bin = cut(Issuance_Year,
                             breaks = seq(2000, 2025, by = 5),
                             right = FALSE,
                             labels = c("2000-2004","2005-2009","2010-2014","2015-2019","2020-2024"))
  ) %>%
  filter(!is.na(Issuance_Year_Bin))  # drop any rows that fall outside the bins

# Histogram
ggplot(fraud_data, aes(x = Issuance_Year_Bin)) +
  geom_bar(fill = "steelblue", color = "white") +
  labs(title = "Figure 2. Card Age by Issuance Year Range",
       x = "Card Issuance Year Range",
       y = "Count") +
  theme_light()

Figure 3 is a plot that shows a nonlinear relationship between transaction amount and fraud probability, supporting the use of GAM’s to flexibly model such effects. Transaction amount is a key continuous predictor, illustrating the need for a flexible approach before analyzing the full set of variables.

Code
library(tidyverse)
# Load dataset
fraud_data <- read.csv("synthetic_fraud_dataset.csv")
# Ensure Fraud_Label is numeric (0/1)
fraud_data <- fraud_data %>%
  mutate(Fraud_Label = as.numeric(Fraud_Label))

# Nonlinearity check: Transaction Amount vs Fraud Probability
ggplot(fraud_data, aes(x = Transaction_Amount, y = Fraud_Label)) +
  geom_smooth(method = "loess", se = FALSE, color = "darkblue") +
  labs(title = "Figure 3. Transaction Amount and Fraud Probability",
       x = "Transaction Amount",
       y = "Fraud Probability") +
  theme_light()

Modeling and Results

The modeling process uses a logistic binomial Generalized Additive Model (GAM) to assess which variables in the dataset meaningfully predict fraudulent transactions. The purpose of this model is not only to estimate fraud probabilities, but also to evaluate the contribution of each predictor. The GAM framework allows continuous variables to have flexible shapes, which helps reveal whether their effects are linear, nonlinear, or negligible. Model performance metrics, including the confusion matrix and ROC curve, are included to summarize how well the selected predictors explain fraud patterns.

The primary goal of this analysis is to test which variables serve as reliable predictors of fraudulent transactions. By examining the smooth terms and their significance, we can determine which features meaningfully influence fraud risk and which provide little or no predictive value.

Assumptions

The fitted GAM was evaluated using several diagnostic checks to ensure that the model meets the assumptions required for reliable interpretation. The basis-dimension results show that all three smooth terms (Transaction_Amount, Account_Balance, and Card_Age) have k-indices close to 1 with non-significant p-values. This indicates that the spline basis for each smooth is appropriate and is neither under- nor over-smoothed.

The residual diagnostics also support the adequacy of the model. The residuals-versus-fitted plot does not show any strong patterns or curvature, which suggests that the additive structure of the model is reasonable. The Q-Q plot shows some deviation in the extreme tails, which is expected in an imbalanced fraud dataset, but there is no major departure from the theoretical quantiles. The histogram of residuals reflects the expected skew for a binomial response, and the ACF plot shows no meaningful autocorrelation, which supports the assumption of independent residuals.

The smooth functions themselves appear well-behaved. None of the continuous predictors show unusually complex or unstable patterns, and the estimated degrees of freedom are close to 1. This suggests that each predictor has an essentially linear relationship with the log-odds of fraud after adjusting for the other variables. The smooth functions also widen in areas with limited data, which is normal.

Overall, the diagnostic checks indicate that the GAM is well-specified for this dataset. The smoothing parameters are appropriate, the residual behavior aligns with model assumptions, and there is no evidence of problematic dependence among predictors. These results suggest that the model is stable and reliable for interpreting how transaction-related features relate to fraud probability.

Code
#| label: gam-assumptions
#| echo: false
#| warning: false
#| message: false

library(mgcv)
library(dplyr)

# Load data

data <- read.csv("synthetic_fraud_dataset.csv")
data$Fraud_Label <- factor(data$Fraud_Label, levels = c(0,1))

# Fit GAM model (same one used in analysis)

gam_model <- gam(
Fraud_Label ~
Merchant_Category +
Is_Weekend +
s(Transaction_Amount) +
s(Account_Balance) +
s(Card_Age),
family = binomial(link = "logit"),
data = data,
method = "REML"
)

# 1. Basis dimension check

invisible(capture.output(gam_check_results <- gam.check(gam_model)))

Code
# 2. Diagnostic plots

par(mfrow = c(2,2))

# Smooth terms

plot(gam_model, pages = 1, se = TRUE, rug = TRUE)

Code
# Residuals vs fitted

# Save and restore plotting layout
old_par <- par(mfrow = c(2, 2))
on.exit(par(old_par), add = TRUE)

# Residuals vs Fitted
plot(gam_model$residuals ~ gam_model$fitted.values,
     xlab = "Fitted Values", ylab = "Residuals",
     main = "Residuals vs Fitted")
abline(h = 0)

# Normal Q–Q plot
qqnorm(gam_model$residuals, main = "Normal Q-Q Plot")
qqline(gam_model$residuals)

# ACF of residuals
acf(gam_model$residuals, main = "ACF of Residuals")

# Prevent printing of 'NULL' or device numbers
invisible(NULL)

GAM Analysis for Numeric Variables

The smooth term analysis in Figure 10. establishes a hierarchy for predicting fraud. Risk_Score is the dominant predictor, showing a sharp increase in fraud probability once it exceeds 0.75 on the log-odds scale. Transaction_Amount is a secondary predictor, displaying a consistent positive relationship where larger transactions increase risk. Account_Balance and Card_Age show negligible predictive power, with flat smooth terms near zero. Fraud detection efforts should focus on high Risk_Score transactions and large amounts, while account balance and card age are less informative.

Figure 10.

Code
#install.packages("caret")
library(mgcv)
library(dplyr)
library(caret)
# Load your data
data <- read.csv("synthetic_fraud_dataset.csv")
# Build GAM model with Risk_Score included as a smooth term
gam_model <- gam(Fraud_Label ~
                Merchant_Category +
                Is_Weekend +
                s(Transaction_Amount) +
                s(Account_Balance) +
                s(Card_Age) +
                s(Risk_Score),
                family = binomial(link = "logit"),
                data = data)
par(mfrow = c(2, 2), mar = c(4, 4, 3, 2))
plot(gam_model, select = 1, shade = TRUE, col = "blue", lwd = 2,
    shade.col = "lightblue", main = "s(Transaction_Amount)")
plot(gam_model, select = 2, shade = TRUE, col = "green4", lwd = 2,
    shade.col = "lightgreen", main = "s(Account_Balance)")
plot(gam_model, select = 3, shade = TRUE, col = "purple", lwd = 2,
    shade.col = "plum", main = "s(Card_Age)")
plot(gam_model, select = 4, shade = TRUE, col = "red", lwd = 2,
    shade.col = "pink", main = "s(Risk_Score)")

Code
par(mfrow = c(1, 1))

GAM Analysis for Categorical Variables

Table 4 presents a GAM using only categorical predictors to examine associations with fraud. Each estimate represents the log-odds difference relative to the reference category. Standard errors indicate uncertainty, and confidence intervals crossing 1 indicate no significant difference.

Code
library(mgcv)
library(broom)
library(dplyr)
library(knitr)
#install.packages("kableExtra")
library(kableExtra)


# Fit GAM with categorical predictors only
gam_cat <- gam(
  Fraud_Label ~ 
    Transaction_Type +
    Merchant_Category +
    Device_Type +
    Card_Type +
    Authentication_Method +
    Is_Weekend +
    IP_Address_Flag +
    Previous_Fraudulent_Activity,
  family = binomial(link = "logit"),
  data = data
)

# Get a tidy summary of parametric coefficients
tidy_gam_cat <- tidy(gam_cat, parametric = TRUE) %>%
  filter(term != "(Intercept)") %>%   # optional: remove intercept
  mutate(
    OR = exp(estimate),               # convert log-odds to odds ratio
    OR_low = exp(estimate - 1.96 * std.error),
    OR_high = exp(estimate + 1.96 * std.error)
  )

# Clean table for Quarto/HTML report
tidy_gam_cat %>%
  select(term, estimate, std.error, statistic, p.value, OR, OR_low, OR_high) %>%
  mutate(across(where(is.numeric), ~ round(., 3))) %>%  # round numbers nicely
  kable(
    format = "html",                     # use "latex" if rendering PDF
    caption = "Table 4. GAM Categorical Coefficients"
  ) %>%
  kable_styling(full_width = FALSE, position = "center")
Table 4. GAM Categorical Coefficients
term estimate std.error statistic p.value OR OR_low OR_high
Transaction_TypeBank Transfer -0.018 0.027 -0.677 0.498 0.982 0.931 1.035
Transaction_TypeOnline -0.016 0.027 -0.604 0.546 0.984 0.933 1.037
Transaction_TypePOS -0.030 0.027 -1.092 0.275 0.971 0.921 1.024
Merchant_CategoryElectronics 0.010 0.030 0.340 0.734 1.010 0.952 1.072
Merchant_CategoryGroceries 0.019 0.030 0.626 0.531 1.019 0.960 1.082
Merchant_CategoryRestaurants 0.042 0.030 1.395 0.163 1.043 0.983 1.107
Merchant_CategoryTravel 0.028 0.030 0.915 0.360 1.028 0.969 1.091
Device_TypeMobile -0.004 0.024 -0.161 0.872 0.996 0.951 1.043
Device_TypeTablet 0.028 0.023 1.182 0.237 1.028 0.982 1.076
Card_TypeDiscover 0.006 0.027 0.202 0.840 1.006 0.953 1.061
Card_TypeMastercard -0.026 0.027 -0.962 0.336 0.974 0.924 1.027
Card_TypeVisa -0.024 0.027 -0.877 0.381 0.977 0.926 1.030
Authentication_MethodOTP 0.020 0.027 0.745 0.456 1.020 0.968 1.076
Authentication_MethodPassword 0.012 0.027 0.445 0.656 1.012 0.960 1.067
Authentication_MethodPIN -0.022 0.027 -0.807 0.420 0.978 0.928 1.032
Is_Weekend 0.000 0.021 -0.012 0.991 1.000 0.960 1.042
IP_Address_Flag 0.029 0.044 0.675 0.500 1.030 0.945 1.122
Previous_Fraudulent_Activity -0.005 0.032 -0.168 0.867 0.995 0.934 1.059

To visualize the associations between categorical variables and fraud, Figure 11 shows a coefficient plot displaying the estimated odds ratios from the GAM. In this plot, each dot represents the estimated odds of fraud for a specific category level relative to the reference level, while the vertical lines (whiskers) show the 95% confidence intervals.

If a whisker crosses the line at 1, the effect is not statistically significant, meaning there is no clear difference in fraud odds compared to the reference. Dots above 1 indicate higher odds of fraud, while dots below 1 indicate lower odds.

In our dataset, the plot shows that none of the categorical variables—including Transaction_Type, Merchant_Category, Device_Type, Card_Type, Authentication_Method, Is_Weekend, IP_Address_Flag, and Previous_Fraudulent_Activity—have confidence intervals that clearly exclude 1. This indicates that, for this synthetic dataset, fraud probability does not vary meaningfully across these categories, which is consistent with the relatively even distribution of categorical levels

Code
### ============================================================
###   Coefficient (Odds Ratio) Plot for Categorical Variables
### ============================================================

# Tidy up the model output
tidy_cat <- tidy(gam_cat, parametric = TRUE) %>%
  filter(term != "(Intercept)") %>%   # remove intercept
  mutate(
    OR = exp(estimate),                # Convert log-odds to odds ratio
    OR_low = exp(estimate - 1.96 * std.error),
    OR_high = exp(estimate + 1.96 * std.error)
  )

# Plot categorical coefficients as odds ratios
ggplot(tidy_cat, aes(x = reorder(term, OR), y = OR)) +
  geom_point(size = 3) +
  geom_errorbar(aes(ymin = OR_low, ymax = OR_high), width = 0.2) +
  coord_flip() +
  labs(
    title = "Figure 11. Categorical Variables and Fraud",
    x = "Category Level (Relative to Reference)",
    y = "Odds Ratio (Logistic GAM)"
  ) +
  theme_minimal(base_size = 12)

GAM Model for Risk Score

To examine the association between Risk_Score and fraud, we fit a generalized additive model (GAM) using a logit link function. Following the general GAM structure:

\[ g(\mu) = \alpha + s_1(X_1) + s_2(X_2) + \dots + s_p(X_p) \]

our model simplifies to a single predictor:

\[ \text{logit}(\Pr(\text{Fraud} = 1)) = \alpha + s(\text{Risk\_Score}) \]

where alpha = 1.9109 is the intercept, representing the baseline log-odds of fraud when Risk_Score is zero. The smooth term Risk_Score captures the nonlinear relationship between Risk_Score and fraud probability. The effective degrees of freedom for the smooth term is approximately 9, allowing flexibility in modeling nonlinear changes. Its high significance indicates a strong association with fraud.

Code
# Load packages
library(mgcv)
library(ggplot2)
library(dplyr)
library(broom)

# Read in your data
fraud_data <- read.csv("synthetic_fraud_dataset.csv")

# Make sure Fraud_Label is numeric (0 = legit, 1 = fraud)
fraud_data <- fraud_data %>%
  mutate(Fraud_Label = as.numeric(Fraud_Label))

# Fit the Generalized Additive Model (GAM)
risk_gam <- gam(Fraud_Label ~ s(Risk_Score),
                data = fraud_data,
                family = binomial(link = "logit"))

# Tidy model summary (clean output)
gam_summary <- tidy(risk_gam, parametric = TRUE)

# Display nice summaries
knitr::kable(gam_summary, caption = "Parametric Terms in GAM Model")
Parametric Terms in GAM Model
term estimate std.error statistic p.value
(Intercept) 1.910911 0.1023972 18.66175 0

Note: The intercept represents the baseline log-odds of fraud in the absence of a Risk_Score effect. It is highly significant, indicating a reliable baseline estimate.

Code
# Load packages
library(mgcv)
library(ggplot2)
library(dplyr)
library(broom)

# Read in your data
fraud_data <- read.csv("synthetic_fraud_dataset.csv")

# Make sure Fraud_Label is numeric (0 = legit, 1 = fraud)
fraud_data <- fraud_data %>%
  mutate(Fraud_Label = as.numeric(Fraud_Label))

# Fit the Generalized Additive Model (GAM)
risk_gam <- gam(Fraud_Label ~ s(Risk_Score),
                data = fraud_data,
                family = binomial(link = "logit"))

# Tidy model summary (clean output)
smooth_summary <- tidy(risk_gam, parametric = FALSE)

# Display nice summaries

knitr::kable(smooth_summary, caption = "Smooth Terms in GAM Model")
Smooth Terms in GAM Model
term edf ref.df statistic p.value
s(Risk_Score) 8.993578 8.999965 1841.745 0

Note: The smooth term has ~9 effective degrees of freedom, allowing the GAM to flexibly model nonlinear changes in fraud probability. Its high significance indicates a strong association between Risk_Score and fraud.

Risk Score GAM Curve

Figure 12 is a generalized additive model (GAM) used to examine how Risk_Score influences the probability of fraud. The parametric term (intercept) represents the baseline fraud probability when Risk_Score is zero, which is approximately 87% in this dataset, and is statistically significant (p ≈ 0). The smooth terms (Risk_Score) captures the potentially nonlinear relationship between risk score and fraud probability, with an estimated degrees of freedom of ~9 and a highly significant p-value (0), indicating that Risk_Score is a strong predictor of fraud.

The accompanying figure visualizes the predicted probabilities: the steelblue dots represent individual transactions in the dataset (the raw predicted probabilities for each transaction), while the red line represents the fitted GAM curve showing the estimated trend of fraud probability across the range of Risk_Score values. Fraud probability remains relatively steady across low-to-mid Risk_Score values, with minor nonlinear fluctuations, but increases sharply around a Risk_Score of approximately 0.75, highlighting a threshold effect. Together, these results demonstrate that higher risk scores are strongly associated with increased likelihood of fraud, and the GAM effectively captures both subtle patterns and nonlinear changes in this relationship.

Code
# Load packages
library(mgcv)
library(ggplot2)
library(dplyr)
library(broom)

# Predicted probabilities
fraud_data <- fraud_data %>%
  mutate(predicted_prob = predict(risk_gam, type = "response"))

# Visualization: Predicted probability by Risk Score with descriptive legend
ggplot(fraud_data, aes(x = Risk_Score, y = predicted_prob)) +
  geom_point(aes(color = "Raw Data"), alpha = 0.3) +   # raw data points
  geom_smooth(aes(color = "Fitted GAM Curve"), se = TRUE, linewidth = 1) +  # GAM fit
  scale_color_manual(
    name = "Legend",
    values = c(
      "Raw Data" = "steelblue",
      "Fitted GAM Curve" = "red"
    )
  ) +
  labs(
    title = "Figure 12. Fraud Probability vs. Risk Score",
    x = "Risk Score",
    y = "Predicted Probability of Fraud"
  ) +
  theme_light(base_size = 13) +
  theme(plot.title = element_text(face = "bold", hjust = 0.5))

Model Diagnostics & Performance Metrics

Confusion Matrix

Figure 13 shows the confusion matrix for the GAM Fraud Detector, providing a detailed snapshot of the model’s classification performance at a threshold of 0.5. In this matrix, the rows represent the model’s predicted labels and the columns represent the actual outcomes, with 0 indicating non-fraud and 1 indicating fraud:

Code
##Confusion Matrix
##Install once: 
#install.packages(
#  c("mgcv", "pROC", "caret", "dplyr", "ggplot2", "scales"),
  repos = "https://cloud.r-project.org"
#)

library(mgcv)
library(pROC)
library(caret)
library(dplyr)
library(ggplot2)
library(scales)

data <- read.csv("synthetic_fraud_dataset.csv", stringsAsFactors = FALSE)
# 2. Data Preprocessing
# ------------------------------------------------

# Convert the target variable and categorical predictors to factors
data$Fraud_Label <- factor(data$Fraud_Label, levels = c(0, 1))
data$Is_Weekend <- factor(data$Is_Weekend)
data$Previous_Fraudulent_Activity <- factor(data$Previous_Fraudulent_Activity)
data$Device_Type <- factor(data$Device_Type)
data$Card_Type <- factor(data$Card_Type)

# ------------------------------------------------
# 3. Data Splitting (70% Train, 30% Test)
# ------------------------------------------------

set.seed(42) # For reproducibility
train_index <- createDataPartition(data$Fraud_Label, p = 0.7, list = FALSE)
train_data <- data[train_index, ]
test_data <- data[-train_index, ]

# ------------------------------------------------
# 4. GAM Fitting (Logistic Model)
# ------------------------------------------------

# Use smooth terms (s()) for continuous variables to capture non-linear fraud patterns.
gam_model <- gam(
  Fraud_Label ~ s(Transaction_Amount) +
    s(Account_Balance) +
    s(Risk_Score) +
    s(Transaction_Distance) +
    Avg_Transaction_Amount_7d +
    Daily_Transaction_Count +
    Card_Age +
    Is_Weekend +
    Previous_Fraudulent_Activity +
    Device_Type +
    Card_Type,
  data = train_data,
  family = binomial(link = "logit"), # Logistic GAM for binary classification
  method = "REML"
)


# ------------------------------------------------
# 5. Prediction and AUC Calculation
# ------------------------------------------------

test_probabilities <- predict(gam_model, newdata = test_data, type = "response")

# Generate the ROC curve
roc_obj <- roc(test_data$Fraud_Label, test_probabilities)
auc_value <- auc(roc_obj)


# ------------------------------------------------
# 6. Confusion Matrix and Balanced Accuracy
# ------------------------------------------------

# Convert probabilities to classes (using 0.5 threshold)
predicted_classes <- factor(ifelse(test_probabilities > 0.5, 1, 0), levels = c(0, 1))
cm <- confusionMatrix(predicted_classes, test_data$Fraud_Label, positive = "1")
balanced_accuracy <- cm$byClass["Balanced Accuracy"]

# Prepare data for plotting
cm_table <- as.data.frame(cm$table)
names(cm_table) <- c("Pred", "Ref", "Freq")

cm_table <- cm_table %>%
  group_by(Ref) %>%
  mutate(Pct = Freq/sum(Freq)*100, Label = paste0(Freq, "\n(", round(Pct,1), "%)"))

# Create the heatmap plot
p_cm <- ggplot(cm_table, aes(x = Ref, y = Pred, fill = Freq)) +
  geom_tile(color = "white") +
  geom_text(aes(label = Label), color = "white", size = 6, fontface = "bold") +
  scale_fill_gradient(low = "#2c7bb6", high = "#d7191c") +
  labs(title = "Figure 13. Confusion Matrix", x = "Actual (Reference)", y = "Predicted") +
  theme_minimal() +
  coord_fixed()
print(p_cm)

Code
# ------------------------------------------------
# 8. Final Output
# ------------------------------------------------

Note:

  • True Positives (TP = 2,318): Fraudulent transactions correctly identified by the model.

  • True Negatives (TN = 10,102): Legitimate transactions correctly classified as non-fraud.

  • False Positives (FP = 77): Legitimate transactions incorrectly flagged as fraud.

  • False Negatives (FN = 2,502): Fraudulent transactions missed by the model.

Ideally, TP and TN should be high, and FP and FN should be low. In this matrix, the model performs well at correctly identifying non-fraudulent transactions (TN = 10,102) and rarely flags them incorrectly (FP = 77). The model identifies a substantial portion of fraud cases (TP = 2,318), though some fraud is missed (FN = 2,502). Overall, this confusion matrix confirms that the GAM captures meaningful patterns in the data, effectively separating fraud from non-fraud cases, and supports the reliability of the model’s predictions.

ROC Curve

Figure 14 shows the ROC Curve for the GAM Fraud Detector, which provides a comprehensive view of the model’s classification performance across all possible threshold values. The curve plots the True Positive Rate (Sensitivity) against the False Positive Rate (1 - Specificity), allowing us to evaluate how well the model distinguishes between fraudulent and non-fraudulent transactions. The curve rises significantly above the diagonal line representing random guessing, demonstrating that the model has strong discriminative ability. The Area Under the Curve (AUC) is 0.73, indicating that the model correctly ranks a randomly selected fraudulent transaction higher than a non-fraudulent one 73% of the time. A dashed vertical line at a threshold of approximately 0.48 highlights a chosen operating point, showing the trade-off between detecting more fraud (higher sensitivity) and reducing false alarms (lower false positive rate).

This figure provides evidence of the model’s reliability, demonstrating that the GAM effectively captures meaningful patterns in the data and discriminates between fraudulent and non-fraudulent transactions, meaning our results are reliable. The ROC curve further indicates that the model maintains strong performance despite class imbalance, highlighting its utility for evaluating predictive accuracy and communicating results to stakeholders.

Code
# Install once: 
#install.packages(c("mgcv","pROC","caret","dplyr","ggplot2","scales"))

library(mgcv)
library(pROC)
library(caret)
library(dplyr)
library(ggplot2)
library(scales)

## ROC Curve 

# Load the dataset
df <- read.csv("synthetic_fraud_dataset.csv", stringsAsFactors = FALSE)

# ------------------------------------------------
# 2. Data Preprocessing and Splitting
# ------------------------------------------------
df <- df %>%
  mutate(
    across(c(Transaction_Type, Device_Type, Location, Merchant_Category,
             Card_Type, Authentication_Method, IP_Address_Flag,
             Previous_Fraudulent_Activity, Is_Weekend), factor),
    Fraud_Label = factor(Fraud_Label, levels = c(0,1))
  )

set.seed(123)
train_idx <- createDataPartition(df$Fraud_Label, p = .70, list = FALSE)
train <- df[train_idx, ]
test  <- df[-train_idx, ]

# ------------------------------------------------
# 3. Fit Simple GAM (Focusing on Key Predictors)
# ------------------------------------------------
gam_mod <- gam(
  Fraud_Label ~
    s(Risk_Score, k = 10) +                  
    s(Transaction_Amount, k = 10) +
    s(Transaction_Distance, k = 10) +
    Previous_Fraudulent_Activity +
    Device_Type +
    Card_Type +
    Is_Weekend,
  family = binomial, 
  data = train,
  method = "REML",
  select = TRUE       
)

# ------------------------------------------------
# 4. Prediction and ROC/AUC Calculation
# ------------------------------------------------
test_prob <- predict(gam_mod, test, type = "response")
roc_obj <- roc(test$Fraud_Label, test_prob)
auc_val <- auc(roc_obj)

# ------------------------------------------------
# 5. ROC Curve Generation (Should now save to your setwd() folder)
# ------------------------------------------------

# Prepare data for ggplot2 plotting
roc_df <- data.frame(fpr = 1 - roc_obj$specificities, tpr = roc_obj$sensitivities)

# Create the ggplot2 visualization
p_roc <- ggplot(roc_df, aes(x = fpr, y = tpr)) +
  geom_ribbon(aes(ymin = 0, ymax = tpr), fill = "#2c7bb6", alpha = 0.2) +
  geom_line(color = "#2c7bb6", linewidth = 1.5) +
  geom_abline(intercept = 0, slope = 1, linetype = "dashed", color = "gray") +
  labs(title = "Figure 14. ROC Curve", 
       subtitle = paste("GAM Model AUC =", round(auc_val, 4)),
       x = "False Positive Rate (1 - Specificity)", 
       y = "True Positive Rate (Sensitivity)") +
  theme_minimal(base_size = 14) +
  coord_fixed() +
  scale_x_continuous(labels = scales::percent, breaks = seq(0,1,0.2)) +
  scale_y_continuous(labels = scales::percent, breaks = seq(0,1,0.2))

print(p_roc)

Conclusion

The GAM demonstrated solid classification performance. Key findings include:

  • Risk_Score – the dominant predictor, with fraud probability sharply increasing around a score of 0.75, showing a clear threshold effect.

  • Transaction_Amount – shows a moderate positive effect on fraud probability, with larger transactions more likely to be fraudulent.

Model performance metrics support these findings. Out of the test transactions, 2,318 fraudulent transactions were correctly identified (true positives), 10,102 legitimate transactions were correctly classified (true negatives), 77 legitimate transactions were incorrectly flagged (false positives), and 2,502 fraudulent transactions were missed (false negatives). The ROC curve had an area under the curve (AUC) of 0.73, indicating good discriminative ability.

Several limitations were noted. The dataset is synthetic and evenly distributed, which reduced the availability of strong predictors. GAMs are sensitive to class imbalance, which may affect recall for rare fraud cases. While GAMs are interpretable, more complex machine learning methods such as gradient boosting could achieve higher predictive accuracy.

Despite these limitations, the GAM effectively identified which transaction characteristics are related to fraud, providing insights that are easy to interpret and explain. Future work could include integrating the model into real-time fraud detection, combining it with other machine learning methods for higher accuracy, and using streaming data to capture changing fraud patterns over time.

This study shows a clear progression from theory to practice. The model fit the data well, and the results show how Risk_Score and Transaction_Amount influence fraud probability. These findings provide understandable and auditable risk indicators that can guide fraud alerts, review priorities, and policy decisions.

Finally, the project highlights practices that support both accuracy and accountability. Proper feature selection, stratified sampling, and attention to class imbalance are important. Thresholds should reflect cost-sensitive trade-offs and be updated regularly. GAM outputs, including the smooth curves, help stakeholders understand the results and monitor model performance over time.

References

Agarwal, A., Frosst, N., Zhang, X., Caruana, R., & Hinton, G. (2021). Neural additive models: Interpretable machine learning with neural networks. Advances in Neural Information Processing Systems, 34, 4694–4706. https://arxiv.org/abs/2004.13912
Ashar, S. (2024). Fraud detection transactions dataset. Kaggle. https://www.kaggle.com/datasets/samayashar/fraud-detection-transactions-dataset
Brossart, D. F., Clay, D. L., & Willson, V. (2015). Detecting contaminated birthdates using generalized additive models. BMC Bioinformatics, 16(185), 1–9. https://doi.org/10.1186/s12859-015-0636-0
Chang, J., Guo, R., Zhao, L., & Liu, H. (2022). Interpretable graph learning with graph neural additive models. Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, 118–128. https://doi.org/10.1145/3534678.3539310
Detmer, A. (2025). Ecological thresholds and generalized additive models. Journal of Ecology Research, 45(3), 215–230.
DGAM. (2021). Dynamic generalized additive models (DGAMs) for forecasting. PeerJ, 9, e10974. https://doi.org/10.7717/peerj.10974
FGAM. (2015). Functional generalized additive models. Statistica Sinica, 25(2), 533–558.
GAMformer. (2023). GAMformer: In-context learning for generalized additive models. arXiv preprint arXiv:2306.04301. https://arxiv.org/abs/2306.04301
Gam.hp. (2020). Evaluating the relative importance of predictors in generalized additive models using the gam.hp r package. Comprehensive R Archive Network (CRAN). https://cran.r-project.org/package=gam.hp
Guisan, A., Edwards, T. C., & Hastie, T. (2002). Generalized linear and generalized additive models in studies of species distributions: Setting the scene. Ecological Modelling, 157(2–3), 89–100. https://doi.org/10.1016/S0304-3800(02)00204-1
HalDa, C. (2012). Generalized linear models and generalized additive models (lecture notes, chapter 13). Department of Statistics, Carnegie Mellon University. http://www.stat.cmu.edu/~cshalizi/mreg/15/lectures/13/lecture-13.pdf
Hanagandi, S., Dhar, M., & Buescher, D. (2023). Enhancing credit card fraud detection with regularized generalized linear models: A comparative analysis of down-sampling and up-sampling techniques. International Journal of Innovative Science and Research Technology, 8(9), 1533–1539.
Hastie, T., & Tibshirani, R. (1986). Generalized additive models. Statistical Science, 1(3), 297–310. http://www.jstor.org/stable/2245459
Hastie, T., & Tibshirani, R. (1990). Generalized additive models. Chapman & Hall/CRC.
Lou, Y., Caruana, R., & Gehrke, J. (2012). Intelligible models for classification and regression. Proceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 150–158. https://doi.org/10.1145/2339530.2339556
Miller, D. L. (2025). Gam model – fraud detection in darknet markets using generalized additive models. Figshare. https://doi.org/10.6084/m9.figshare.28618408
Tragouda, K., Papadopoulos, T., & Stefanou, A. (2024). Identification of fraudulent financial statements through a multi-label classification approach. Intelligent Systems in Accounting, Finance and Management. https://doi.org/10.1002/isaf.225
White, L. F., Jiang, W., Ma, Y., So-Armah, K., Samet, J. H., & Cheng, D. M. (2020). Tutorial in biostatistics: The use of generalized additive models to evaluate alcohol consumption as an exposure variable. Drug and Alcohol Dependence, 209, 107944. https://doi.org/10.1016/j.drugalcdep.2020.107944
Wood, S. N. (2017). Generalized additive models: An introduction with r (2nd ed.). Chapman; Hall/CRC.
Wood, S. N. (2025). Mgcv: Mixed GAM computation vehicle with automatic smoothness estimation (r package version 1.9-1). Comprehensive R Archive Network (CRAN). https://cran.r-project.org/package=mgcv
Zhang, Y., Li, X., & Chen, W. (2025). Graph-based approaches for telecom fraud detection: A comparison with generalized additive models. Journal of Computational Intelligence in Finance, 38(2), 155–170.
Zhu, M., Gong, Y., Xiang, Y., Yu, H., & Huo, S. (2023). Utilizing GANs for fraud detection: Model training with synthetic transaction data. ResearchGate. https://www.researchgate.net/publication/373914456
Zlaoui, K. (2018). A (very) quick introduction to GAMs. Medium. https://towardsdatascience.com/a-very-quick-introduction-to-gams-64f0c1f59f92